Candidate Cluster Extraction for Hierarchical Document Clustering

نویسندگان

  • Leena H. Patil
  • Mohammad Atique
چکیده

Text Document are tremendously increasing in the internet, the hierarchical document clustering has proven to be useful in grouping similar document for large applications. Still most documents suffer from problems of high dimensionality, scalability, accuracy and meaningful cluster labels. In this paper an new approach fuzzy frequent itemsets based hierarchical clustering is proposed, in which fuzzy association rule mining algorithm is used to improve the clustering accuracy. In this approach firstly the key terms are extracted from the document set and each document is preprocessed into the document representation for the further mining process. Secondly, a fuzzy association rule mining algorithm for text is discover to find the sets, highly related fuzzy frequent itemsets, in which the key terms are regarded as a labels of the candidate clusters. Referring the candidate cluster it has been experimentally evaluated based on classic 30, Classic 4, and tr11 data sets for two methods FIHC and K-means in MATLAB 2009Rb. KeywordsIntroduction, document clustering approach, document preprocessing, candidate cluster extraction, Fmeasure evaluation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fuzzy Association Rule Mining Algorithm to Generate Candidate Cluster: An Approach to Hierarchical Document Clustering

As text documents are largely increasing in the internet, the process of grouping similar documents for versatile applications have put the eye of researchers in this area. However most clustering methods suffer from challenges in dealing with problems of high dimensionality, scalability, accuracy and meaningful cluster labels. Hierarchical clustering is a solution on that. Proper clustering se...

متن کامل

Hierarchical and Partitioning Algorithm for Document Clustring: a Survey

Document clustering is the widely researched area because of large amount of rich and dynamic information are available in world wide web. It is the application of cluster analysis to texual documents. There are different applications of document clustering include automatic document organization, data mining , topic extraction and filtering or fast information retrieval. The purpose of this su...

متن کامل

Applying Formal Concept Analysis to Teaching Material Extraction

Text summarization system can save the time for user when reading large number of documents. The summary of text summarization system usually composed of meaningful sentence which represent content of text. The relations between keyword usually come from their cooccurrences in document. This study using hierarchical clustering method cluster sentences and apply concept formal analysis to find o...

متن کامل

Document Classification and Visualisation to Support the Investigation of Suspected Fraud

This position paper reports on ongoing work where three clustering and visualisation techniques for large document collections – developed at the Joint Research Centre (JRC) – are applied to textual data to support the European Commission’s investigation on suspected fraud cases. The techniques are (a) an implementation of the neural network application WEBSOM, (b) hierarchical cluster analysis...

متن کامل

Assessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories

In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011